San Francisco Police Incidents

The city of San Francisco is not only known as a commercial and financial center of Northen California, but is also earning a growing reputation for having one of the highest crime rates in the country. The San Francisco Police Department (SFPD) is the city police department of San Francisco that serves an estimated population of 1.2 million. The SFPD has been frequently met with criticism, due to the large number of cases that remain unsolved every year. For this reason, SFPD is determined to build trust, engage with the San Francisco community, and drive positive outcomes in public safety. In an effort to be as transparent as possible with information about the department and its operation, SFPD has shared various data sets. The Police Department incidents data is analyzed in this project. Of interest is also the effect of crimes rates on the real estate prices. To aid in this inference a secondary dataset consisting of house prices was downloaded from Zillow. Both the datasets were available as CSV files.

The Police Department incidents is imported into R and named as incidents. It is derived from the SFPD Crime Incident Reporting System and consists of a list of crime entries from 01/01/2003 up until 12/31/2017. It has around 2.18 million entries of incidents. There are 13 features in this dataset which are as follows.

  1. IncidntNum
  2. Category
  3. Descript
  4. DayOfWeek
  5. Date
  6. Time
  7. PdDistrict
  8. Resolution
  9. Address
  10. X
  11. Y
  12. Location
  13. PdId

The columns IncidentNum and PdId are unique identification numbers for each incident occured. The column Category consists of 39 different categories of incidents committed all over San Francisco. Detailed description for all the incidents is provided in the column Descript. Further details such as the date, day and time when a particular incident occured are given in columns Date, DayOfWeek and Time respectively. The police district that handled the incident is in PdDistrict and the resolution that was provided by them is in Resolution. There are 10 distinct police districts accross San Francisco. The address where the incident occured with exact co-ordinates for longitude and latitude are given in Address, X and Y respectively. Location is a string, (Y, X), containing the latitude and longitude. A glimpse of the dataset is given below.

## Observations: 2,175,688
## Variables: 13
## $ IncidntNum <int> 150060275, 150098210, 150098210, 150098210, 1500982...
## $ Category   <chr> "NON-CRIMINAL", "ROBBERY", "ASSAULT", "SECONDARY CO...
## $ Descript   <chr> "LOST PROPERTY", "ROBBERY, BODILY FORCE", "AGGRAVAT...
## $ DayOfWeek  <chr> "Monday", "Sunday", "Sunday", "Sunday", "Tuesday", ...
## $ Date       <chr> "01/19/2015", "02/01/2015", "02/01/2015", "02/01/20...
## $ Time       <time> 14:00:00, 15:45:00, 15:45:00, 15:45:00, 19:00:00, ...
## $ PdDistrict <chr> "MISSION", "TENDERLOIN", "TENDERLOIN", "TENDERLOIN"...
## $ Resolution <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NO...
## $ Address    <chr> "18TH ST / VALENCIA ST", "300 Block of LEAVENWORTH ...
## $ X          <dbl> -122.4216, -122.4144, -122.4144, -122.4144, -122.43...
## $ Y          <dbl> 37.76170, 37.78419, 37.78419, 37.78419, 37.80047, 3...
## $ Location   <chr> "(37.7617007179518, -122.42158168137)", "(37.784190...
## $ PdId       <dbl> 1.500603e+13, 1.500982e+13, 1.500982e+13, 1.500982e...

The secondary dataset from Zillow is also imported into R and named as house_prices. It has ~15000 records of median price per square feet for all the regions in US. It has features that include RegionID, RegionName, City, State, Metro, CountyName, SizeRank and 261 columns of price per sqft from 1996 to 2017 for all 12 months. From this dataset, we would require the median house prices per sqft data for only San Francisco.

Problem Statements

The two main goals of this capstone are,

  1. To predict if an incident would be resolved by a Police department, given its category. This will indicate the efficacy in crime resolution of the Police department in a given district. Such information is of prime interest to multiple parties. For example, the police can develop specific strategies to reduce crime types that otherwise prove difficult to resolve. The administration of a given district can introduce stricter laws and protocols to deter crimes with less resolution (thereby decreasing the frequency of these crimes). The local population would be most interested in understanding types of crimes that require vigilance.

  2. Infer if crime rates affect property prices in San Francisco. That is, to see if the areas that have low crime rates enjoy higher property values. This is of interest to realtors, sellers and buyers of property. Buyers would use this information to assess the neighborhood safety of the target property. Sellers would use it to value the resale price of their property for maximum profit.

Data Wrangling

The data wrangling method will involve identifying the variables that have an effect on the categories of crime. This includes creating new variables such as Year, Month, Date, etc. and deleting varibles that have no effect on the analysis such as Incident Number, PdId, etc. It also involves identifying missing/outlier values (if any!) and replacing/deleting them appropriately. The Zillow dataset must be combined with the SFPD dataset to yield the property price corresponding to each crime location. A more detailed explanation for the same follows.

Incidents Dataset

The dataset is examined for dimensions, columns names, structure and summary statistics. The IncidntNum and PdId columns are removed as they contain unique id for incidents registered and hence will not be of much use. All other columns are retained. The Date column is separated into Month, DayofMonth and Year columns. These columns are converted to numeric data type. The columns X, Y are respectively renamed to the more descriptive longitude and latitude.

The data is then checked for missing values. It is observed that it has only 1 row with a missing PdDistrict value. This observation is stored separately in a dataframe called mis_incidents. The remaining complete cases are stored in incidents dataframe. The missing value is now imputed as follows. It is noticed that the Address column in incidents is not unique. Therefore, the Address from mis_incidents is matched to Address in incidents dataframe. Then, the corresponding PdDistrict for that address is filled into the missing value. The Address in mis_incidents, 100 Block of VELASCO AV, was found to belong to the Ingleside police district. This observation is then added back to the incidents dataframe.

House Prices Dataset

This dataset is also examined for dimensions, columns names, structure and summary statistics. A new dataframe called sf_house_price is created by filtering the house prices for San Francisco city. The columns RegionID, State, Metro, CountyName and SizeRank are removed as only the columns City, RegionName and the pricepersqft for all the months and years is required. It is noticed that data is in wide format, where 261 of the 263 columns represent the price per sqft for a given month and year. This is converted into a long format data with 4 columns, namely, RegionName, City, yearmonth and pricepersqft. The yearmonth column is separated into 2 columns, year and month. The column RegionName is also renamed to a more descriptive name zipcode.

Merge Incidents and House prices Data

The zipcode library is used to combine the incidents and sf_house_price dataframes. The zipcode data in this library is utilized to obtain corresponding zipcodes for the latitude and longitude columns in incidents dataset. This new column is later used as a key to join incidents and sf_house_prices. The procedure for merging the two dataframes is described below.

A new dataframe called zipcode_ca is created by filtering the zipcode for California state.

To match the zipcodes obtain the zipcode corresponding to a given latitude and longitude in the incidents dataset, a function called get_zipcodes is created. This function will take latitude and longitude as its input values, match it with the latitude and longitude in the zipcode_ca data and then return the corresponding zipcode as an output. There is always a possibility that the input values do not match exactly with the one in zipcode_ca. For this reason, we calculate the difference between the 2 latitudes and 2 longitudes, and then find the Euclidean distance between them. The zipcode for the value which has the minimum distance is then taken as the match.

As mentioned earlier, the incidents dataframe has about ~2.18 million records in it. Also, the columns Address, longitude and latitude have a number of values which are repeated. Hence, we create a temporary dataframe called incidents_loc, which would contain only distinct values of the above mentioned columns and reduce the processing time considerably. incidents_loc has only ~77000 records.

The function get_zipcodes is applied to the incidents_loc dataframe and a new column zipcodes is added to it. The updated incidents_loc is now joined with incidents to get a column zipcode for all the ~2.18 million observations.

This zipcode column is used to join incidents with sf_house_price. The resulting incidents_house_price dataframe now contains the house prices corresponding to each incident location by year and month.

incidents_house_price

Before proceeding to exploratory data analysis, some changes are made to the incidents_house_price dataframe. The column Category as mentioned before, has 39 distinct crime categories as shown below.

To simplify the analysis, categories with common crime types are combined together. This resulted in a reduction of categories from 39 to 15. The resulting new categories are stored in a dataframe called Categories. The table below shows the list of old categories along with new, common group name.

The Categories and incidents_house_price are joined to create a new dataframe incidents_new_categories which now has a column for new category name. The columns New_Category, PdDistrict, month, DayOfMonth, Time and DayOfWeek are converted into factors. This is the file used in the rest of the analysis that follows. A glimpse of this dataset is given below.

## Observations: 2,175,687
## Variables: 17
## $ Category     <chr> "NON-CRIMINAL", "ROBBERY", "ASSAULT", "SECONDARY ...
## $ Descript     <chr> "LOST PROPERTY", "ROBBERY, BODILY FORCE", "AGGRAV...
## $ DayOfWeek    <fctr> Monday, Sunday, Sunday, Sunday, Tuesday, Sunday,...
## $ month        <fctr> 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, ...
## $ DayOfMonth   <fctr> 19, 1, 1, 1, 27, 1, 31, 31, 31, 31, 1, 1, 1, 1, ...
## $ year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2...
## $ Time         <fctr> 14, 15, 15, 15, 19, 16, 21, 21, 16, 17, 14, 14, ...
## $ PdDistrict   <fctr> MISSION, TENDERLOIN, TENDERLOIN, TENDERLOIN, NOR...
## $ Resolution   <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "...
## $ Address      <chr> "18TH ST / VALENCIA ST", "300 Block of LEAVENWORT...
## $ longitude    <dbl> -122.4216, -122.4144, -122.4144, -122.4144, -122....
## $ latitude     <dbl> 37.76170, 37.78419, 37.78419, 37.78419, 37.80047,...
## $ Location     <chr> "(37.7617007179518, -122.42158168137)", "(37.7841...
## $ zipcode      <int> 94114, 94102, 94102, 94102, 94123, 94118, 94124, ...
## $ pricepersqft <int> 1022, 893, 893, 893, 1173, 918, 516, 516, 1022, 8...
## $ New_Category <fctr> NON-CRIMINAL, BURGLARY, ASSAULT, OTHER OFFENSES,...
## $ Resolved     <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...

Exploratory Data Analysis

Density of crimes by category

To ensure good visualization of density of crimes by category on a map, without much loss of information, only a subset of incidents_new_categories is taken. This subset is called filt_zipc. It is filtered by columns zipcode, New_Category, latitude and longitude. As this is a large dataframe, we consider only those zipcode areas where more than 10 crimes have occured. The number of crimes for each category are shown on a map of San Francisco (using the leaflet library). This provides a spatial visualization of areas with high concentration of crimes by crime-types. The tool tip indicates the category followed by the number of crimes in parenthesis ().

One observes a higher density of crimes in the north-east part of the map. This region belongs to the SOUTHERN, MISSION, CENTRAL, BAYVIEW and NORTHERN police districts. One possible explanation for the high number of crimes could be due to the large population density. Also, there is more opportunity due to the large number of tourists in this area. The south-east region has a relatively much lower crime density. Note, however, that there are four hot-spots near the south.

In addition to the plot above, it is also instructive to visualize crimes by category. As an illustration, the plots below show the distribution of crimes for the top 2 (Theft and Arson) and bottom 2 (sexual offenses and Weapon) categories. The high larceny/theft rate in the city area is probably due to the increased use of public transportation which makes it convenient for thieves to target. As mentioned previously, the large number of tourists visiting the area are also easy victims. All other crimes can be explained (as a pedantic excercise) but are not further pursued in this document.

Number of crimes for each category

The bar plot below provides a relative comparison of the number of crimes for each category. Crimes are summed over all the years. We observe that from 2003 to 2018, Theft, Arson, Assault and Burglary were (are) major concerns for San Francisco Police Department. The Other Offenses and Non-criminal categories, that contain several smaller numbers of non-violent crimes, also have a significant contribution.

The plot shows a year-wise trend of Theft, Arson, Assault and Burglary. It can be seen that there hasn’t been any significant decrease in the number of crimes across years.

Incidents by month and year

The heatmap below shows the number of crimes by month and year. It allows for easy visualization of hot-spots, i.e., month-year combinations with high crime-rates. One observes that, there is an increase in crimes from 2013 - 2017 compared to the previous years. Crimes are lower during February, November and December. For the winter months of November and December, the lower population density due to holiday/vacation could explain the lower crimes rates. Tourism is also low around this time. In contrast, the summer months of March to October see a relatively larger number of crimes, probably due to the corresponding increased population density and hence opportunity.

Trend of incidents vs time

The number of incidents for the top six categories occuring during a 24 hour period are shown in the plot below. The Time axis corresponds to the 24-hour clock time. Each datapoint corresponds to the total of all the crimes between the years 2003 - 2018, for the corresponding category and time. One can consider the trends in crimes to be divided in 3 distinct time slots:

  1. 3 a.m - 7 a.m : Has a lower number of crimes compared to other times during the day, most likely because a majority of the people are at home during this time. Therefore, there is less opportunity. After 7 a.m there is a gradual increase in the crime rate.
  2. 10 a.m - 1 p.m: Has a peak at 12 p.m most likely corresponding to lunch break hours for most organisations.
  3. 5 p.m - 12 a.m: Has a peak at 6 p.m probably because of the increase in population due to people returning from work.

For each datapoint in the plot above, the plot below provides the breakdown by year. One observes the following:

  1. After 2005, there has been a drastic decrease in the cumulative count of vehicle thefts throughout the day, from ~18000 to ~8000. Nevertheless, the trend for increased thefts after 3pm (“increased” relative to other times during the 24-hour period) is consistent year over year.
  2. After 2009, drug and alcohol cases have seen a reduction from ~13000 to ~10000 cases, although this decrease has been inconsistent year over year.
  3. There has been a drastic increase in the number of thefts every year since 2011, with apporoximately 5000 cases added every year.

Number of incidents resolved by PD district

The bar plot below shows the number of incidents resolved for each PD district. As seen above in the leaflet map, the SOUTHERN, MISSION, CENTRAL and NORTHERN regions have the largest number of crimes, but the SFPD has not been able to resolve most of the cases. Tenderloin is the only police district where the number of resolved cases exceed the number of unresolved cases.

Statistical Analysis

Number of crimes for each category

The bar plot below shows the number of crimes for each category stacked by police district. THEFT is the largest crime category with ~475000 incidents in total. The largest contributors, in order of the number of theft incidents, come from the SOUTHERN, NORTHERN, CENTRAL, MISSION, and BAYVIEW police districts. OTHER_OFFENSES accounts for ~410000 incidents, with the distribution across police districts following the same trend as in THEFT. Individual contributions from the remaining categories are ~240000 each, less than 50% compared to THEFT, but the relative contributions from each police district are similar to those of THEFT in most cases.

Number of crimes by month

The box plot below shows the distribution of the total number of crimes n, by month, between the years 2003 and 2018. The dashed line is the median of n. The bottom and top solid lines are the first and third quartiles of n respectively. Crime distributions for February and December are below the first quartile implying lower crimes rates compared to the other months. October on the other hand has a higher number of crimes, as it is above the third quartile. The remaining months fall between the first and third quartiles, and are therefore representative samples of n. One also notes two outliers in the months of January and December.

Number of crimes by week

The box plots below show the distribution of the total number of crimes n, by week, accross all the years by crime category . The dashed line is the median of n; the bottom and top solid lines are the first and third quartiles of n respectively for each category. If all crimes are considered together, the trend in the plot will be weighted by and follow the trend of theft. To gain better insight into the trends for each category, the plot is faceted by New_category. This gives us a different picture. One notes that the days of the week for which the number of crimes are above or below the distribution (solid lines) vary according to the crime type. Each warrants a socio-economic explanation which is beyond the scope of this document.

Number of crimes by time of day

The box plot below shows the trend in the number of crimes by time of day. As is evident, there is a significant decrease between 3am to 6am, while there is a substantial increase in the evening from 4pm to 7pm. There are also a high number of crimes during lunch time at 12pm. Explanations are analogous to those provided in the Trend of incidents vs time section.

Density of crime by area

The map below shows the distribution of crimes in each police district. One notices that the areas BAYVIEW, CENTRAL, MISSION, NORTHERN and SOUTHERN have a higher density of crime whereas the remaining areas are comparitively lower. The police district borders were drawn using a shapefile downloaded from the data.sfgov.org website.

## OGR data source with driver: ESRI Shapefile 
## Source: "Current-Police-Districts", layer: "geo_export_222db914-0d75-42f9-9a5b-a7f0a125858c"
## with 10 features
## It has 5 fields

Relationship between incidents and house price

The correlation plot below shows a high correlation between price per sqft and year, whereas it shows a low correlation between price per sqft and number of incidents in an area. This answers the second problem statement, of whether crime rates affect property prices in San Francisco. It can be seen clearly that they are uncorrelated. This implies that number of crime incidents in an area hardly has an effect on the property prices.

Machine Learning

We use the incidents_new_categories dataframe to build a logistic model to predict if an incident would be resolved by a Police department, given its category (first problem statement). The column Resolution is the response variable. This column is coded as 0 if the resolution provided is “NONE” and 1 for all others. It is then saved in a new column called Resolved. The column Resolvedis converted into factor.

The ratio of resolved versus unresolved cases is 1.6:1 indicating that this is a fairly balanced dataset. Moreover, the large number of observations ensures that the training samples spans the entire feature space resulting in a robust model.

The dataset is partitioned into training and testing sets. The training set contains 70% and the testing set contains 30% of the data. The training set is used to build a logistic regression model to predict the response, Resolved, as a function of five predictors namely, New_Category, PdDistrict, month, DayOfMonth and Time.

The glm function is used with a binomial family (error distribution) since the response is binary. Results from the model are shown below.

The estimates from the model provide a relationship between the predictor variables and the response. The coefficients are in the form of logits. These are converted into log-odds for easy interpretation as shown in the image below. The log-odds is calculated by taking the exponential value for the coefficients of each predictor variable. It can be seen that Theft and Vehicle Theft have the least chance of being resolved by the police. From the odds-ratio table below it is seen, that for every unit increase in Theft and Vehicle Theft, the log-odds of the SFPD resolving the case changes only by 0.284 and 0.275 units when compared to Warrants which has a log-odds valus of 46.934. For PdDistricts Southern, Mission and Northern, where density of crime is high, the log-odds values are 1.205, 1.279 and 0.913 units respectively which is fairly lower than Tenderloin which has the highest log-odds of 2.157. The log-odds value for DayOfMonth, Time and Month also change by 1.001, 1.004 and 0.99 respectively. It can be noted that all the 5 predictors are equally significant in this model.

Various models were built using different sets of predictors. AIC was used as the metric to determine the best model. The AIC value for the model obtained above was the lowest and was therefore chosen as the champion model. A model with a lower AIC is always considered a better one.

Now this model is used on the testing data and its performance is evaluated. Results of the confusion matrix are shown below. The probability threshold value is set as 0.5. A sensitivity value of 0.80 and a specificity value of 0.69 is achieved along with an accuracy of 0.766.

The receiving operating characteristic, ROC, is a measure of classifier performance. To calculate area under curve (AUC), ROC curve is generated by plotting false positive rate against true positive rate (plot below). The AUC for this model is calculated as 0.745. Hence, it does a fairly good job at discriminating between two categories in the response variable.

Recommendation

Based on the analysis, the following recommendations can be made:

  1. To the SFPD: It is clearly seen that out of the 10 police districts across San Francisco, Tenderloin is the only one where the number of resolved cases exceeds the number of unresolved cases. This implies, that other police districts would perhaps benefit by adopting specific strategies successfully employed by Tenderloin which could help reduce incidents that are difficult to resolve.

  2. To the city administration: The summer months of March to October see a relatively larger number of incidents San Francisco. This is likely due to the increased population density during summer. Therefore, taking extra security measures and deploying more police forces around this time could reduce incidents.

  3. To the resident population: It is not only the responsibility of SFPD and city administration, but also of the general population to deter crimes by being vigilant and cautious especially in areas covered by the Bayview, Central, Mission, Northern and Southern police districts.